{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preprocessing\n",
    "\n",
    "In this section, we learn how to preprocess data using [CSV.jl](https://csv.juliadata.org/stable/), [DataFrames.jl](https://dataframes.juliadata.org/stable/), [FeatureTransforms.jl](https://invenia.github.io/FeatureTransforms.jl/stable/), and [Statistics](https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics).\n",
    "\n",
    "## Reading the Dataset¶\n",
    "\n",
    "Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data. Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of gravitational physics”. To demonstrate how to load CSV files with CSV.jl, we create a CSV file below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corresponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the roof type (RoofType), and the price (Price)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"../data/house_tiny.csv\""
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using CSV\n",
    "\n",
    "csv_data = \"\"\"\n",
    "NumRooms,RoofType,Price\n",
    ",,127500\n",
    "2,,106000\n",
    "4,Slate,178100\n",
    ",,140000\n",
    "\"\"\"\n",
    "\n",
    "dir_path = joinpath(\"..\",\"data\")\n",
    "file_path = joinpath(dir_path,\"house_tiny.csv\")\n",
    "mkpath(dir_path)\n",
    "CSV.write(file_path,CSV.File(IOBuffer(csv_data)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let’s import `DataFrames.jl` and load the dataset with `CSV.read`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><div style = \"float: left;\"><span>4×3 DataFrame</span></div><div style = \"clear: both;\"></div></div><div class = \"data-frame\" style = \"overflow-x: scroll;\"><table class = \"data-frame\" style = \"margin-bottom: 6px;\"><thead><tr class = \"header\"><th class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">Row</th><th style = \"text-align: left;\">NumRooms</th><th style = \"text-align: left;\">RoofType</th><th style = \"text-align: left;\">Price</th></tr><tr class = \"subheader headerLastRow\"><th class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\"></th><th title = \"Union{Missing, Int64}\" style = \"text-align: left;\">Int64?</th><th title = \"Union{Missing, String7}\" style = \"text-align: left;\">String7?</th><th title = \"Int64\" style = \"text-align: left;\">Int64</th></tr></thead><tbody><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">1</td><td style = \"font-style: italic; text-align: right;\">missing</td><td style = \"font-style: italic; text-align: left;\">missing</td><td style = \"text-align: right;\">127500</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">2</td><td style = \"text-align: right;\">2</td><td style = \"font-style: italic; text-align: left;\">missing</td><td style = \"text-align: right;\">106000</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">3</td><td style = \"text-align: right;\">4</td><td style = \"text-align: left;\">Slate</td><td style = \"text-align: right;\">178100</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">4</td><td style = \"font-style: italic; text-align: right;\">missing</td><td style = \"font-style: italic; text-align: left;\">missing</td><td style = \"text-align: right;\">140000</td></tr></tbody></table></div>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& NumRooms & RoofType & Price\\\\\n",
       "\t\\hline\n",
       "\t& Int64? & String7? & Int64\\\\\n",
       "\t\\hline\n",
       "\t1 & \\emph{missing} & \\emph{missing} & 127500 \\\\\n",
       "\t2 & 2 & \\emph{missing} & 106000 \\\\\n",
       "\t3 & 4 & Slate & 178100 \\\\\n",
       "\t4 & \\emph{missing} & \\emph{missing} & 140000 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "\u001b[1m4×3 DataFrame\u001b[0m\n",
       "\u001b[1m Row \u001b[0m│\u001b[1m NumRooms \u001b[0m\u001b[1m RoofType \u001b[0m\u001b[1m Price  \u001b[0m\n",
       "     │\u001b[90m Int64?   \u001b[0m\u001b[90m String7? \u001b[0m\u001b[90m Int64  \u001b[0m\n",
       "─────┼────────────────────────────\n",
       "   1 │\u001b[90m  missing \u001b[0m\u001b[90m missing  \u001b[0m 127500\n",
       "   2 │        2 \u001b[90m missing  \u001b[0m 106000\n",
       "   3 │        4  Slate     178100\n",
       "   4 │\u001b[90m  missing \u001b[0m\u001b[90m missing  \u001b[0m 140000"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using DataFrames\n",
    "data = CSV.read(open(file_path),DataFrame)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preparation\n",
    "\n",
    "In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to separate out columns corresponding to input versus target values. We can select columns either by name or via integer-location based indexing.\n",
    "\n",
    "You might have noticed that replaced all CSV entries with value with a special `missing` value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are called missing values and they are the “bed bugs” of data science, a persistent menace that you will confront throughout your career. Depending upon the context, missing values might be handled either via *imputation* or *deletion*. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.\n",
    "\n",
    "Here are some common imputation heuristics. For categorical input fields, we can treat `missing` as a category. Since the RoofType column takes values `Slate` and `missing`, `FeatureTransforms.jl` can convert this column into two columns `RoofType_Slate` and `RoofType_missing`. A row whose roof type is `Slate` will set values of `RoofType_Slate` and `RoofType_missing` to `true` and `false`, respectively. The converse holds for a row with a missing `RoofType` value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><div style = \"float: left;\"><span>4×3 DataFrame</span></div><div style = \"clear: both;\"></div></div><div class = \"data-frame\" style = \"overflow-x: scroll;\"><table class = \"data-frame\" style = \"margin-bottom: 6px;\"><thead><tr class = \"header\"><th class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">Row</th><th style = \"text-align: left;\">NumRooms</th><th style = \"text-align: left;\">RoofType_missing</th><th style = \"text-align: left;\">RoofType_Slate</th></tr><tr class = \"subheader headerLastRow\"><th class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\"></th><th title = \"Union{Missing, Int64}\" style = \"text-align: left;\">Int64?</th><th title = \"Bool\" style = \"text-align: left;\">Bool</th><th title = \"Bool\" style = \"text-align: left;\">Bool</th></tr></thead><tbody><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">1</td><td style = \"font-style: italic; text-align: right;\">missing</td><td style = \"text-align: right;\">true</td><td style = \"text-align: right;\">false</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">2</td><td style = \"text-align: right;\">2</td><td style = \"text-align: right;\">true</td><td style = \"text-align: right;\">false</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">3</td><td style = \"text-align: right;\">4</td><td style = \"text-align: right;\">false</td><td style = \"text-align: right;\">true</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">4</td><td style = \"font-style: italic; text-align: right;\">missing</td><td style = \"text-align: right;\">true</td><td style = \"text-align: right;\">false</td></tr></tbody></table></div>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& NumRooms & RoofType\\_missing & RoofType\\_Slate\\\\\n",
       "\t\\hline\n",
       "\t& Int64? & Bool & Bool\\\\\n",
       "\t\\hline\n",
       "\t1 & \\emph{missing} & 1 & 0 \\\\\n",
       "\t2 & 2 & 1 & 0 \\\\\n",
       "\t3 & 4 & 0 & 1 \\\\\n",
       "\t4 & \\emph{missing} & 1 & 0 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "\u001b[1m4×3 DataFrame\u001b[0m\n",
       "\u001b[1m Row \u001b[0m│\u001b[1m NumRooms \u001b[0m\u001b[1m RoofType_missing \u001b[0m\u001b[1m RoofType_Slate \u001b[0m\n",
       "     │\u001b[90m Int64?   \u001b[0m\u001b[90m Bool             \u001b[0m\u001b[90m Bool           \u001b[0m\n",
       "─────┼────────────────────────────────────────────\n",
       "   1 │\u001b[90m  missing \u001b[0m             true           false\n",
       "   2 │        2              true           false\n",
       "   3 │        4             false            true\n",
       "   4 │\u001b[90m  missing \u001b[0m             true           false"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using FeatureTransforms\n",
    "\n",
    "inputs, targets = data[:,1:2], data[:,3]\n",
    "\n",
    "# get all unique roof types\n",
    "roof_types = unique(inputs.RoofType)\n",
    "# one-hot encoding column RoofType, and append to original DataFrames\n",
    "inputs = FeatureTransforms.apply_append(inputs,OneHotEncoding(roof_types);cols=:RoofType,header=\"RoofType_\".*string.(roof_types))\n",
    "# drop column RoofType\n",
    "inputs = inputs[:,Not(:RoofType)]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For missing numerical values, one common heuristic is to replace the `missing` entries with the mean value of the corresponding column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><div style = \"float: left;\"><span>4×3 DataFrame</span></div><div style = \"clear: both;\"></div></div><div class = \"data-frame\" style = \"overflow-x: scroll;\"><table class = \"data-frame\" style = \"margin-bottom: 6px;\"><thead><tr class = \"header\"><th class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">Row</th><th style = \"text-align: left;\">NumRooms</th><th style = \"text-align: left;\">RoofType_missing</th><th style = \"text-align: left;\">RoofType_Slate</th></tr><tr class = \"subheader headerLastRow\"><th class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\"></th><th title = \"Float64\" style = \"text-align: left;\">Float64</th><th title = \"Bool\" style = \"text-align: left;\">Bool</th><th title = \"Bool\" style = \"text-align: left;\">Bool</th></tr></thead><tbody><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">1</td><td style = \"text-align: right;\">3.0</td><td style = \"text-align: right;\">true</td><td style = \"text-align: right;\">false</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">2</td><td style = \"text-align: right;\">2.0</td><td style = \"text-align: right;\">true</td><td style = \"text-align: right;\">false</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">3</td><td style = \"text-align: right;\">4.0</td><td style = \"text-align: right;\">false</td><td style = \"text-align: right;\">true</td></tr><tr><td class = \"rowNumber\" style = \"font-weight: bold; text-align: right;\">4</td><td style = \"text-align: right;\">3.0</td><td style = \"text-align: right;\">true</td><td style = \"text-align: right;\">false</td></tr></tbody></table></div>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|ccc}\n",
       "\t& NumRooms & RoofType\\_missing & RoofType\\_Slate\\\\\n",
       "\t\\hline\n",
       "\t& Float64 & Bool & Bool\\\\\n",
       "\t\\hline\n",
       "\t1 & 3.0 & 1 & 0 \\\\\n",
       "\t2 & 2.0 & 1 & 0 \\\\\n",
       "\t3 & 4.0 & 0 & 1 \\\\\n",
       "\t4 & 3.0 & 1 & 0 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "\u001b[1m4×3 DataFrame\u001b[0m\n",
       "\u001b[1m Row \u001b[0m│\u001b[1m NumRooms \u001b[0m\u001b[1m RoofType_missing \u001b[0m\u001b[1m RoofType_Slate \u001b[0m\n",
       "     │\u001b[90m Float64  \u001b[0m\u001b[90m Bool             \u001b[0m\u001b[90m Bool           \u001b[0m\n",
       "─────┼────────────────────────────────────────────\n",
       "   1 │      3.0              true           false\n",
       "   2 │      2.0              true           false\n",
       "   3 │      4.0             false            true\n",
       "   4 │      3.0              true           false"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using Statistics\n",
    "inputs.NumRooms = replace(inputs.NumRooms, missing=>mean(skipmissing(inputs.NumRooms)))\n",
    "inputs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conversion to the Matrix \n",
    "We can load `inputs` and `targets` into `Matrix`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4×3 Matrix{Float64}:\n",
       " 3.0  1.0  0.0\n",
       " 2.0  1.0  0.0\n",
       " 4.0  0.0  1.0\n",
       " 3.0  1.0  0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "4-element Vector{Int64}:\n",
       " 127500\n",
       " 106000\n",
       " 178100\n",
       " 140000"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "X, y = Matrix(inputs), Vector(targets)\n",
    "display(X), display(y);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.9.0",
   "language": "julia",
   "name": "julia-1.9"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}